Add new pre-trained models BERTweet and PhoBERT #6129

datquocnguyen · 2020-07-29T12:59:40Z

I'd like to add pre-trained BERTweet and PhoBERT models to the transformers library.

Users now can use these models directly from transformers. E.g:

bertweettokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")
bertweetmodel = BertweetModel.from_pretrained("vinai/bertweet-base")

phoberttokenizer = PhobertTokenizer.from_pretrained("vinai/phobert-large")
phobertmodel = PhobertModel.from_pretrained("vinai/phobert-large")

BERTweet: A pre-trained language model for English Tweets
PhoBERT: Pre-trained language models for Vietnamese

Re-add `bart` to LM_MAPPING

Re-add `from .configuration_mobilebert import MobileBertConfig` not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig`

Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer.

datquocnguyen · 2020-07-30T03:59:58Z

I'd like to add pre-trained BERTweet and PhoBERT models to the transformers library.

Users now can use these models directly from transformers. E.g:
bertweettokenizer = BertweetTokenizer.from_pretrained("vinai/bertweet-base")
bertweetmodel = BertweetModel.from_pretrained("vinai/bertweet-base")

phoberttokenizer = PhobertTokenizer.from_pretrained("vinai/phobert-large")
phobertmodel = PhobertModel.from_pretrained("vinai/phobert-large")
BERTweet: A pre-trained language model for English Tweets
PhoBERT: Pre-trained language models for Vietnamese

Whether I can get any support from huggingface w.r.t. this pull request @julien-c ? Thanks.

LysandreJik · 2020-07-31T11:09:35Z

Hello @datquocnguyen ! As you've said, BERTweet and PhoBERT reimplement the RoBERTa model without adding any special behavior. I don't think it's necessary to reimplement them then, is it? Uploading them on the hub should be enough to load them into RoBERTa architectures, right?

datquocnguyen · 2020-07-31T11:19:30Z

Hi @LysandreJik
They use different tokenizers (i.e. fastBPE), so we cannot load their tokenizers using RoBERTa.
Please see a loading example using RoBERTa: https://github.com/VinAIResearch/BERTweet#transformers
An issue related to this is at: #5965

datquocnguyen · 2020-07-31T11:26:25Z

I hope both BERTweet and PhoBERT could be incorporated into transformers in a similar manner to as their counterparts (e.g. CamemBERT and FlauBERT). @LysandreJik Please let me know what I can do for this. Thanks.

LysandreJik · 2020-07-31T12:39:51Z

Yes, I understand, that makes sense. There shouldn't be any issue in incorporating them into transformers.

LysandreJik · 2020-07-31T13:04:58Z

I've taken a quick look at it, and it looks very cool! Something that we can maybe do better, is regarding the tokenizers:

They're currently untested, but they're the main contribution of this PR so they definitely should be tested.
If possible, we would like not to add an additional dependency (in this case FastBPE). It would be great to leverage the already existing library huggingface/tokenizers
On that front, given it's a BPE tokenizer, it should be easy enough to leverage the OpenAI GPT (not GPT-2) tokenizer, which seems very similar. It might even be possible to load the vocab/merge files directly in OpenAIGPTTokenizer.

Let me know what you think!

LysandreJik · 2020-07-31T13:16:18Z

Haven't tried it directly, but as seen with @n1t0, since you're not doing any fancy pre-processing it might be as simple as the following:

class PhobertTokenizerFast(PreTrainedTokenizerFast):
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["attention_mask"]
    def __init__(self, vocab_file, merges_file, unk_token="<unk>", **kwargs):
        kwargs.setdefault("unk_token", unk_token)
        super().__init__(
            CharBPETokenizer(vocab_file=vocab_file, merges_file=merges_file, unk_token=unk_token, lowercase=False, bert_normalizer=False, split_on_whitespace_only=True),
            **kwargs,
        )

datquocnguyen · 2020-07-31T16:44:40Z

Thanks very much @LysandreJik I will revise the code following your comments and inform you as soon as I complete it.

JetRunner · 2020-08-01T11:18:41Z

@datquocnguyen Yeah, these models are cool. Lovin' it. I think we can try to figure out how to convert fastBPE formats to our compatible format before adding it directly to our dependency (I believe XLM uses fastBPE). so would you hold on a little when we try to figure it out? We have to be cautious when adding dependencies! Thanks!
cc @LysandreJik

datquocnguyen · 2020-08-01T14:20:35Z

Yes. Thanks @JetRunner

justinphan3110 · 2020-08-06T04:19:43Z

some tokenizer function (decode, convert_ids_to_tokens) hasn't implemented for PhoBertTokenizer right?

Miopas · 2020-08-10T13:56:36Z

@datquocnguyen Thank you for this pull request. I tried the Bertweet model and met a problem that the tokenizer encoded special symbols like "<pad>" not as a whole token. Instead, it would split the string into characters like "< p a d >". I fixed the problem by modifying the code at `` as below:

--- a/BERTweet/transformers/tokenization_bertweet.py
+++ b/BERTweet/transformers/tokenization_bertweet.py
@@ -242,9 +242,14 @@ class BertweetTokenizer(PreTrainedTokenizer):
             text = self.normalizeTweet(text)
         return self.bpe.apply([text])[0].split()

-    def convert_tokens_to_ids(self, tokens):
-        """ Converts a list of str tokens into a list of ids using the vocab."""
-        return self.vocab.encode_line(" ".join(tokens), append_eos=False, add_if_not_exist=False).long().tolist()
+    def _convert_token_to_id(self, token):
+        #""" Converts a list of str tokens into a list of ids using the vocab."""
+        #return self.vocab.encode_line(" ".join(tokens), append_eos=False, add_if_not_exist=False).long().tolist()
+        return self.vocab.encode_line(token, append_eos=False, add_if_not_exist=False).long().tolist()[0]
+
+    @property
+    def vocab_size(self) -> int:
+        return len(self.vocab)

From my understanding, to encode a sentence, the order of the interfaces called in this case are PreTrainedTokenizerBase::encode
->PreTrainedTokenizer::_encode_plus
->PreTrainedTokenizer::convert_tokens_to_ids
->PreTrainedTokenizer::_convert_token_to_id_with_added_voc
->BertweetTokenizer::_convert_token_to_id for non-special tokens or PreTrainedTokenizer::added_tokens_encoder for special tokens.
So in the class BertweetTokenizer, it should implement the interface _convert_token_to_id rather than convert_tokens_to_ids.

datquocnguyen · 2020-08-11T11:23:48Z

I will have a look soon. Thanks @Miopas.

SergioBarretoJr · 2020-08-23T01:00:36Z

I have just tried "BertweetTokenizer" and got this error:

"ImportError: cannot import name 'BertweetTokenizer' from 'transformers' (/home/apps/anaconda3/lib/python3.7/site-packages/transformers/init.py)"

Is there any solution to it?

I have also tried:

tokenizer2 = BertTokenizer.from_pretrained("vinai/bertweet-base")
trained = tokenizer2.encode("oops!! pelosi & dems admit numbers submitted to cbo are false! someurl #tcot #tlot #sgp #hcr #p2")

and got:
trained = [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

Is there any solution to it?

thks!

…izer and test files

… file

codecov · 2020-09-17T07:06:45Z

Codecov Report

Merging #6129 into master will decrease coverage by 0.24%.
The diff coverage is 71.14%.

@@            Coverage Diff             @@
##           master    #6129      +/-   ##
==========================================
- Coverage   80.32%   80.08%   -0.25%     
==========================================
  Files         168      170       +2     
  Lines       32285    32642     +357     
==========================================
+ Hits        25932    26140     +208     
- Misses       6353     6502     +149

Impacted Files	Coverage Δ
src/transformers/tokenization_bertweet.py	`63.18% <63.18%> (ø)`
src/transformers/tokenization_phobert.py	`83.45% <83.45%> (ø)`
src/transformers/__init__.py	`99.34% <100.00%> (+<0.01%)`	⬆️
src/transformers/tokenization_auto.py	`92.06% <100.00%> (+0.26%)`	⬆️
src/transformers/modeling_tf_t5.py	`26.05% <0.00%> (-63.52%)`	⬇️
src/transformers/modeling_tf_gpt2.py	`71.84% <0.00%> (-23.17%)`	⬇️
src/transformers/modeling_lxmert.py	`70.01% <0.00%> (-20.75%)`	⬇️
src/transformers/modeling_transfo_xl_utilities.py	`52.98% <0.00%> (-13.44%)`	⬇️
src/transformers/modeling_transfo_xl.py	`67.10% <0.00%> (-12.67%)`	⬇️
src/transformers/tokenization_roberta.py	`87.67% <0.00%> (-10.96%)`	⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0cbcdb...257b9f1. Read the comment docs.

napsternxg · 2020-09-17T11:43:44Z

@datquocnguyen can you also upload your model files on https://huggingface.co/vinai/bertweet-base

I still get this error:

⚠️ Model name 'vinai/bertweet-base' was not found in tokenizers model name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). We assumed 'vinai/bertweet-base' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.json', 'merges.txt'] but couldn't find such vocabulary files at this path or url.

napsternxg · 2020-09-17T12:21:19Z

@datquocnguyen I looked a the PR and looking forward to this merge. I have a few suggestions:

I find the Phobert and Bertweet models to be quite similar. This makes the tokenizers also similar so we should not need a seperate tokenizer for both. Given that both these tokenizers just load fastBPE tokenizer data format, we can simply call them fastBPETokenizer.
Looking at this other code which also uses fastBPE ¹ can't we just follow it to convert the fastBPE tokenizer files to the huggingface format.
- You can easily convert your bpe.codes into merges.txt file and then use the Roberta tokenizer.
- The format is the same and you only need to drop the 3rd column in your BPE.codes and add a top line for comment.
- In your code you are not even using the last column values.
- Your merges.txt can have the following as the first line #version: 1 (look at merges.txt file of Roberta ²)

datquocnguyen · 2020-09-17T16:53:20Z

Hi @napsternxg The model had been already uploaded to https://huggingface.co/vinai/bertweet-base. For now, you would have to install transformers from our development branch (as it has not merged to the master branch of transformers yet). Did you try the following commands?

Python version >= 3.6
PyTorch version >= 1.4.0
Install transformers from our development branch:
- git clone https://github.com/datquocnguyen/transformers.git
- cd transformers
- pip install --upgrade .
Install emoji: pip3 install emoji

Thanks for your suggestions. BertweetTokenizer is specifically designed to work on Tweet data, incorporating a TwitterTokenizer while PhoBERT does not. Note that both our vocab.txt and bpe.codes are also used in loading our models in fairseq. So I would prefer to keep them intact rather than converting them into another format.

datquocnguyen · 2020-09-17T17:07:09Z

Btw, I should mention that BERTweet is accepted as an EMNLP-2020 demo paper while PhoBERT gets a slot in the Findings of EMNLP-2020 volume. Please help review this pull request so that others might benefit from using them directly from the master branch of transformers. Thanks. @LysandreJik @JetRunner @julien-c
All checks have passed and you only need to merge files for the tokenizers and associated tests.

napsternxg · 2020-09-17T17:52:56Z

Thanks that makes sense.
@datquocnguyen I was trying to use it from the models website.
My suggestion on the bpe.codes file was not to remove it but to generate the merges.txt file from it, which will make it compatible with the huggingface tokenizer.

datquocnguyen · 2020-09-17T18:44:02Z

@napsternxg Please remove your "transformers" cache folder from ~/.cache/torch and reinstall transformers from our development branch. I am sure that bertweet would work smoothly:

import torch
from transformers import AutoModel, AutoTokenizer

bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples

tienthanhdhcn · 2020-09-18T01:21:09Z

@datquocnguyen great work and I am looking forward to seeing the PR gets merged so that I can use the models directly from the huggingface transformers.

LysandreJik

Ok I think this is great, I have nothing to add. LGTM, thanks for adding tests!

LysandreJik · 2020-09-18T10:14:47Z

Will merge today unless @julien-c, @JetRunner have comments.

julien-c · 2020-09-18T14:06:11Z

LGTM, do not hesitate to make the tokenizers as generic/configurable as possible, but this can be in a subsequent PR

* Add BERTweet and PhoBERT models * Update modeling_auto.py Re-add `bart` to LM_MAPPING * Update tokenization_auto.py Re-add `from .configuration_mobilebert import MobileBertConfig` not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig` * Add BERTweet and PhoBERT to pretrained_models.rst * Update tokenization_auto.py Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer. * Update BertweetTokenizer - without nltk * Update model card for BERTweet * PhoBERT - with Auto mode - without import fastBPE * PhoBERT - with Auto mode - without import fastBPE * BERTweet - with Auto mode - without import fastBPE * Add PhoBERT and BERTweet to TF modeling auto * Improve Docstrings for PhobertTokenizer and BertweetTokenizer * Update PhoBERT and BERTweet model cards * Fixed a merge conflict in tokenization_auto * Used black to reformat BERTweet- and PhoBERT-related files * Used isort to reformat BERTweet- and PhoBERT-related files * Reformatted BERTweet- and PhoBERT-related files based on flake8 * Updated test files * Updated test files * Updated tf test files * Updated tf test files * Updated tf test files * Updated tf test files * Update commits from huggingface * Delete unnecessary files * Add tokenizers to auto and init files * Add test files for tokenizers * Revised model cards * Update save_vocabulary function in BertweetTokenizer and PhobertTokenizer and test files * Revised test files * Update orders of Phobert and Bertweet tokenizers in auto tokenization file

…6129)" This reverts commit 27da673.

thanhphi0401 · 2020-11-16T16:15:08Z

Any news on it? when Phobert available on HuggingFace?

LysandreJik · 2020-11-16T16:52:04Z

It's been available since September:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

model = AutoModelForMaskedLM.from_pretrained("vinai/phobert-base")

You can see the model card here.

thanhphi0401 · 2020-11-17T02:44:57Z

It's been available since September:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

model = AutoModelForMaskedLM.from_pretrained("vinai/phobert-base")

You can see the model card here.

But i don't see here https://huggingface.co/transformers/pretrained_models.html
How can i integrate with Rasa NLU sir?
Thank you

LysandreJik · 2020-11-18T15:01:36Z

PhoBERT is based off of the RoBERTa implementation, so you can load it in a RobertaForMaskedLM model. The tokenizer is custom, so you should load it through the PhobertTokenizer.

I have never used Rasa NLU, so I can't help you much here. Your best option would be to open a thread on our forum with an example of how you do things for other models, so as not to flood this PR.

You can ping me on the thread (@Lysandre).

Add BERTweet and PhoBERT models

b2b24d1

julien-c added the model card Related to pretrained model cards label Jul 29, 2020

datquocnguyen mentioned this pull request Jul 29, 2020

BerTweet tokenizer issue #5965

Closed

datquocnguyen added 2 commits July 30, 2020 00:10

Update modeling_auto.py

b30bcdf

Re-add `bart` to LM_MAPPING

Update tokenization_auto.py

aa2e5b5

Re-add `from .configuration_mobilebert import MobileBertConfig` not sure why it's replaced by `from transformers.configuration_mobilebert import MobileBertConfig`

datquocnguyen changed the title ~~Add BERTweet and PhoBERT models~~ Add new pre-trained models BERTweet and PhoBERT Jul 29, 2020

datquocnguyen and others added 3 commits July 30, 2020 09:10

Add BERTweet and PhoBERT to pretrained_models.rst

a49233f

Update tokenization_auto.py

c51345a

Remove BertweetTokenizer and PhobertTokenizer out of tokenization_auto.py (they are currently not supported by AutoTokenizer.

Update BertweetTokenizer - without nltk

2cf49f9

Update model card for BERTweet

abc3d25

LysandreJik self-requested a review July 31, 2020 08:53

LysandreJik self-assigned this Jul 31, 2020

manueltonneau mentioned this pull request Aug 1, 2020

Adding BERTweet to the available models ThilinaRajapakse/simpletransformers#607

Closed

datquocnguyen added 3 commits August 31, 2020 02:07

PhoBERT - with Auto mode - without import fastBPE

1287460

PhoBERT - with Auto mode - without import fastBPE

efcbb59

BERTweet - with Auto mode - without import fastBPE

1a10bb5

datquocnguyen added 10 commits September 15, 2020 15:02

Update commits from huggingface

3885e12

Merge branch 'master' of git://github.com/huggingface/transformers

926a6f6

Update commits from huggingface

8db4b5b

Delete unnecessary files

3a883d2

Add tokenizers to auto and init files

e1ee98a

Add test files for tokenizers

f15b08b

Revised model cards

62beafe

Update save_vocabulary function in BertweetTokenizer and PhobertToken…

73de519

…izer and test files

Revised test files

1207489

Update orders of Phobert and Bertweet tokenizers in auto tokenization…

257b9f1

… file

LysandreJik approved these changes Sep 18, 2020

View reviewed changes

LysandreJik merged commit af2322c into huggingface:master Sep 18, 2020

datquocnguyen mentioned this pull request Sep 19, 2020

Reproducing the results of fine-tuning XLMR large in the paper VinAIResearch/BERTweet#15

Closed

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Add new pre-trained models BERTweet and PhoBERT (huggingface#…

2b02d90

…6129)" This reverts commit 27da673.

datquocnguyen mentioned this pull request May 4, 2022

Herbert polish model #7798

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new pre-trained models BERTweet and PhoBERT #6129

Add new pre-trained models BERTweet and PhoBERT #6129

datquocnguyen commented Jul 29, 2020 •

edited

datquocnguyen commented Jul 30, 2020 •

edited

LysandreJik commented Jul 31, 2020

datquocnguyen commented Jul 31, 2020

datquocnguyen commented Jul 31, 2020 •

edited

LysandreJik commented Jul 31, 2020

LysandreJik commented Jul 31, 2020

LysandreJik commented Jul 31, 2020

datquocnguyen commented Jul 31, 2020

JetRunner commented Aug 1, 2020 •

edited

datquocnguyen commented Aug 1, 2020

justinphan3110 commented Aug 6, 2020

Miopas commented Aug 10, 2020

datquocnguyen commented Aug 11, 2020

SergioBarretoJr commented Aug 23, 2020

codecov bot commented Sep 17, 2020

napsternxg commented Sep 17, 2020

napsternxg commented Sep 17, 2020 •

edited

datquocnguyen commented Sep 17, 2020 •

edited

datquocnguyen commented Sep 17, 2020 •

edited

napsternxg commented Sep 17, 2020 •

edited

datquocnguyen commented Sep 17, 2020

tienthanhdhcn commented Sep 18, 2020

LysandreJik left a comment

LysandreJik commented Sep 18, 2020

julien-c commented Sep 18, 2020

thanhphi0401 commented Nov 16, 2020

LysandreJik commented Nov 16, 2020

thanhphi0401 commented Nov 17, 2020

LysandreJik commented Nov 18, 2020

Add new pre-trained models BERTweet and PhoBERT #6129

Add new pre-trained models BERTweet and PhoBERT #6129

Conversation

datquocnguyen commented Jul 29, 2020 • edited

datquocnguyen commented Jul 30, 2020 • edited

LysandreJik commented Jul 31, 2020

datquocnguyen commented Jul 31, 2020

datquocnguyen commented Jul 31, 2020 • edited

LysandreJik commented Jul 31, 2020

LysandreJik commented Jul 31, 2020

LysandreJik commented Jul 31, 2020

datquocnguyen commented Jul 31, 2020

JetRunner commented Aug 1, 2020 • edited

datquocnguyen commented Aug 1, 2020

justinphan3110 commented Aug 6, 2020

Miopas commented Aug 10, 2020

datquocnguyen commented Aug 11, 2020

SergioBarretoJr commented Aug 23, 2020

codecov bot commented Sep 17, 2020

Codecov Report

napsternxg commented Sep 17, 2020

napsternxg commented Sep 17, 2020 • edited

datquocnguyen commented Sep 17, 2020 • edited

datquocnguyen commented Sep 17, 2020 • edited

napsternxg commented Sep 17, 2020 • edited

datquocnguyen commented Sep 17, 2020

tienthanhdhcn commented Sep 18, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik commented Sep 18, 2020

julien-c commented Sep 18, 2020

thanhphi0401 commented Nov 16, 2020

LysandreJik commented Nov 16, 2020

thanhphi0401 commented Nov 17, 2020

LysandreJik commented Nov 18, 2020

datquocnguyen commented Jul 29, 2020 •

edited

datquocnguyen commented Jul 30, 2020 •

edited

datquocnguyen commented Jul 31, 2020 •

edited

JetRunner commented Aug 1, 2020 •

edited

napsternxg commented Sep 17, 2020 •

edited

datquocnguyen commented Sep 17, 2020 •

edited

datquocnguyen commented Sep 17, 2020 •

edited

napsternxg commented Sep 17, 2020 •

edited